GradQuant: Simulation & Cross-Validation

Simulation

Why simulate data?

  • Testing statistical methods: Simulation allows you to test statistical methods on data with known properties, which can help you evaluate the performance of the methods and make modifications as necessary.
  • Generating data for hypothesis testing: When real-world data is limited or difficult to collect, you can simulate data that follows the same distribution as the real-world data to perform hypothesis testing and evaluate the significance of your findings.
  • Assessing model performance: Simulation can help you evaluate the performance of statistical models by generating data that follows a known distribution and comparing the output of the model to the true values of the simulated data.

Why simulate data (continued)?

  • Exploring parameter space: Simulation can allow you to explore the behavior of a system or model over a range of parameter values, which can help you understand how the system or model responds to different conditions.
  • Designing experiments: Simulation can help you design experiments by allowing you to test different scenarios and determine the best way to collect data to answer your research questions.
  • Teaching statistics concepts: Simulation can be a useful tool for teaching statistics concepts by allowing students to explore different distributions and statistical methods in a controlled and interactive environment.

Steps to Simulation

  1. Simulate quantitative variables via random number generation with rnorm(), runif() and rpois().
  2. Generate character variables that represent groups via rep(). We’ll explore how to create character vectors with different repeating patterns.
  3. Create data with both quantitative and categorical variables, making use of functions from the first two steps above.
  4. Learn to use replicate() to repeat the data simulation process many times.

Generating Random Numbers

An easy way to generate numeric data is to pull random numbers from some distribution. This can be done via the functions for generating random deviates. These functions always start with r (for “random”).

The basic distributions that I use the most for generating random numbers are the normal (rnorm()) and uniform (runif()) distributions. We’ll look at those today, plus the Poisson (rpois()) distribution for generating discrete counts.

Generating Random Numbers Example: rnorm

rnorm(500, mean = 5, sd = 2.5) # rnorm(n, mean = 0, sd = 1)
  [1]  4.0744220 -0.6631113 -1.4938442  1.0327910  3.3020188  9.5263304
  [7]  6.1526249  6.9004198  6.2029254  1.6606358  9.3564289  5.0936868
 [13]  5.3690228  5.6417966  7.2479434  0.6392558  5.8326217  3.3999794
 [19]  2.0067846  8.6180385  6.0746022  4.3233744  4.9370981  3.1906527
 [25]  3.0760429  6.8435064  5.7127366  8.7147903  9.1084862  7.5372818
 [31]  2.2135252  2.0139001  5.8661656  6.4006665  1.5632837 -1.4505403
 [37]  1.1758102  6.4671612  3.1919952  9.3064607  2.8264364  6.7721917
 [43]  5.4992484  6.2484091  8.2556242  3.1237138  4.2289504  3.3290292
 [49]  3.1642264 -1.2234577  4.2049746  4.3250427  6.2711321  9.5816936
 [55]  7.1457747  0.7202683  1.0080132  4.6777532  2.6103752  1.5843188
 [61]  5.1104240  6.3485518  3.6276444 -0.3013820  3.2842576  3.7630598
 [67]  8.8486529  9.9750438  3.5519468  7.1410849  9.1735039 -0.2157380
 [73]  5.9855911  4.7356324  3.4999919  5.7777885  5.5226760  2.4211518
 [79]  5.7346470  3.3004269 11.1680332  5.0246806  6.1771277  6.3553012
 [85]  7.2886746  4.9295281  1.9149757  6.6257996  2.9092556  7.2180976
 [91]  4.8351548  4.2533666  0.6799133  5.0675454  7.2388730  1.9601211
 [97]  6.4219252  4.4040108  2.1351804 10.7253157  8.4165219  5.8819310
[103]  6.4348717  4.9877533  6.5748251  2.2206993 -0.9328610  2.2981570
[109]  3.4763679  3.0126770  5.1027268  5.6004567  8.6367270  3.5692073
[115]  3.9898814  1.1737501  3.8636293  6.0175959  4.6138342  7.7041523
[121]  1.9143993  1.1299809  2.5955670  5.1377551  1.7197585  3.9055805
[127]  8.2584245  5.0722696  3.7927354  4.3041399  8.6339433  7.4110929
[133]  8.7325869  4.8820347 10.3678118  5.2596144  2.8977309  2.0805522
[139]  1.4528003  5.7230641  6.1785757  5.9509040  6.0520513  2.5960266
[145]  2.9744759  7.5099370  3.6266630  4.3726784  2.4346484  3.0615130
[151]  4.0283293  2.3064472  3.9476640  6.5620252  9.9680420  8.8904774
[157]  7.0949261  4.9307983  5.6039416  8.8741919  7.1587213  2.6818131
[163]  6.3931149  2.8620200  5.9233765  6.6665734  5.0186604  3.8105735
[169]  4.1053590  3.3987770  4.9001390  3.3505224  9.2589105  4.6786036
[175]  4.5322339  7.0329799  9.0783238  4.9171614  3.3083680  6.9051230
[181]  5.2301445  5.6860564  8.5489653  4.6863038  8.0621595  0.9327867
[187]  4.0096466  2.1329513  3.9576468  4.4199402  6.3518704  7.3440922
[193]  5.2548175  0.9712734  1.5833588  2.0667861  6.4189611 10.5667842
[199]  0.8694644  5.9371731  6.1823367  6.9923585  4.0645231  8.7760277
[205]  0.5186232  4.1691976  4.9181797  2.3938199  5.5770379  2.4600198
[211]  2.0573403  2.9472462  8.0149356  5.4652298  1.7362496  6.1298665
[217]  3.1748732  0.2293907  5.2648301  2.7417252 10.8433705  6.6294636
[223]  4.7174769  5.3019221  9.5770837  5.8563532  4.2048291  1.0559296
[229]  2.1628612  4.6488534  8.1560002  6.4543028  4.5040460  1.9716920
[235]  4.3041204  6.6430053  9.9870814  0.5888005  5.4612701  6.1281272
[241]  4.6461950  4.8849058  6.3515745  4.9893612  8.0522527  1.3889583
[247]  4.6739781 10.1740436  7.2574502  3.4353349  1.6821795  5.2137871
[253]  5.5052366  4.2388087  6.4751071  4.7663489  4.8499309  4.5402410
[259]  9.2262813  4.9712162  1.5952956  4.5718321  6.1177746  1.6363953
[265]  2.3609344  5.8082861  4.3888747  4.9093703  3.1015827  5.6244040
[271]  8.7896382  1.8881282  2.9210608  9.8631544  6.6552596  5.8680233
[277]  6.0042080  6.6946065  6.2087401  5.7173216  4.5537801  2.0267388
[283] -0.0910224 10.0915468  4.4769806  4.6083039  5.7373910  5.3031039
[289]  3.7708689 -0.5771495  7.6700386  6.9173476  7.6598636  9.6494341
[295]  4.6720745  5.4342223  4.0096480  7.2639663 10.8167041 10.2449198
[301]  5.5330908  4.7324453 10.0280277 13.1663397  5.1076641  6.3218079
[307]  6.6422673  5.4638830  3.3477333  3.8580124 11.1970780  4.5722473
[313]  4.9120167  7.9713149  1.3437186 10.3434551  7.4399447  2.6148054
[319]  3.7909850  2.9768034  7.9145246  4.0779785  3.1239615  5.9599689
[325]  5.5194977  8.4813806  6.3524313  5.1373724  5.0871556  3.4266739
[331] 10.5192091  5.8763375  7.5451834  8.4887340  6.6471196  4.9691061
[337]  7.0045304  9.1769521  0.9003797  5.1214860  6.2842168  7.6893447
[343]  4.8236885  7.9790365  5.7921940 11.2503854 -0.2783658  7.1392626
[349]  5.0084469  7.5754699  1.3255516  5.8406024 10.4082027  5.2965500
[355]  4.2780213  6.5395934  7.8886801  6.8294512  5.3542627  7.0910363
[361]  3.4127876  5.9508179  7.6898097  4.1613276  6.4765895  7.5985612
[367]  4.0106511  3.4603109  9.0078327  6.6670818  4.3394913  4.0020980
[373]  0.3424351  7.2180625  6.3025824  5.7959749  1.0880492 -0.9542469
[379]  5.6994182  2.3483075  0.3024595  3.5526703  4.1576592  7.0472502
[385]  3.6710020  0.8475044  5.3285778  3.8269759  6.0793475  2.8543557
[391]  3.1123411  5.0949424  5.7353582  4.9328515  4.1802840  1.5452701
[397]  5.7160106  4.0476103  5.1450422 13.8218128  3.3698335  7.8786949
[403]  4.6894415  4.7473073  0.5482021  2.4846504  3.6396333  4.3272992
[409]  4.6932598  0.1923119  4.7437224  1.3556057  3.9509712  4.7614744
[415]  5.9869725  4.0214653  3.4984383  2.2014005  7.8748837  5.4893405
[421]  2.3219850  3.6863717  3.9513344  1.8592146  5.1476131  0.4144462
[427]  4.4351579  5.3902050  6.2078795  5.5506847  6.2434999  7.9617174
[433]  2.3168658  4.3884156  6.0338952  6.3299445  2.5914348  4.9761867
[439]  6.7921186 10.3784487  4.2899166  6.6003499  2.3022300  3.2842108
[445]  2.2274036  4.6709683 10.5969101  7.2555070  7.0402071  6.9314685
[451]  2.5749199  2.4395031  3.8512557  6.9252022  7.6778424  5.7318702
[457]  2.8948424  1.7076174  4.1047494  4.4876711  3.5835788  4.1937384
[463]  4.0283257  1.0106692  5.6250184 11.9330606  2.8994337  5.7803071
[469]  1.2024163  3.4082488  8.8198654  7.0556167  4.3179848  2.4936984
[475]  2.9079380  5.2844007  0.1585172  6.0283709  2.1954924  4.6954313
[481]  7.5775706  8.5722834  1.7321091 10.0126988  6.0822145  3.4928832
[487]  1.5342811  8.7230069  7.4736637  1.8603389  8.2973504  8.0171600
[493]  1.9986328  7.6249162  5.0020006  6.5185458  4.5738959  0.7478375
[499]  4.8306440 10.1301625

Generating Random Numbers Example: runif

runif(75, min = 50, max = 80) # runif(n, min = 0, max = 1)
 [1] 59.24369 60.70235 53.49217 60.12058 54.24428 60.01833 56.11951 55.02700
 [9] 78.76907 52.15557 73.89650 69.31719 73.76328 79.76175 76.95515 62.50981
[17] 51.60499 56.33447 69.55696 58.12685 57.08393 58.97292 72.99054 77.69095
[25] 63.82595 75.67034 63.02874 76.11186 75.06981 73.85850 67.16961 62.40099
[33] 54.47356 53.23395 69.84990 77.37071 61.92546 50.77723 58.14153 67.99742
[41] 78.58527 54.79227 75.92748 69.06525 67.46828 61.96356 61.37053 62.79848
[49] 66.64429 71.08993 57.77655 72.46413 73.09247 56.49777 64.80713 73.39988
[57] 61.73066 57.55334 72.12048 74.66377 62.77624 72.92085 62.35096 50.30537
[65] 52.50197 59.96953 50.48017 76.46012 66.45100 54.99897 59.40064 56.91190
[73] 66.41967 58.92934 66.41567

Generating Random Numbers Example: rpois

rpois(1000, lambda = 1.7) # rpois(n, lambda)
   [1] 1 1 5 3 4 0 5 5 5 1 0 3 1 0 6 0 4 3 0 2 2 4 1 2 0 4 3 2 2 4 3 2 2 2 1 2 0
  [38] 3 2 0 0 0 5 2 3 0 1 2 1 4 3 1 1 2 3 0 4 1 4 4 4 1 2 3 2 2 2 4 2 3 2 3 0 1
  [75] 1 1 3 1 2 2 2 3 1 2 0 2 3 0 2 0 2 0 3 1 2 2 2 2 5 1 3 3 0 2 2 1 3 1 4 1 2
 [112] 2 4 0 5 0 3 3 0 1 1 1 0 2 2 5 0 1 0 1 0 1 2 4 3 1 3 1 1 1 2 3 5 0 1 0 2 1
 [149] 2 1 4 1 1 3 2 0 2 0 1 0 3 4 2 0 1 2 2 1 2 4 3 1 2 1 2 2 3 0 0 1 2 3 3 1 4
 [186] 0 2 1 1 0 0 2 1 2 1 2 2 2 1 2 2 2 1 3 3 2 3 2 2 0 1 0 0 1 1 1 1 4 3 2 3 3
 [223] 1 1 2 0 1 2 1 1 3 1 2 3 0 1 0 5 2 1 0 0 0 2 2 2 3 0 1 0 4 1 3 2 4 0 0 2 3
 [260] 0 1 0 2 5 1 4 0 2 4 2 1 2 0 4 2 1 0 0 4 2 2 1 1 1 1 4 4 1 1 0 0 2 4 2 1 1
 [297] 0 4 3 1 0 3 3 1 1 1 1 3 4 4 2 1 3 3 4 0 3 1 0 2 1 3 2 1 2 2 0 3 2 1 4 1 2
 [334] 0 2 1 0 0 2 0 2 1 2 2 1 2 1 2 1 3 0 3 1 1 1 1 2 1 1 0 0 3 2 1 2 1 2 4 1 1
 [371] 2 5 1 1 0 2 1 1 0 2 0 0 2 5 3 2 1 3 1 1 2 3 2 2 2 1 3 2 0 2 1 2 1 1 1 2 2
 [408] 2 1 0 0 1 2 1 1 5 2 3 6 1 2 3 1 1 0 0 3 2 0 2 3 0 0 0 1 1 3 3 3 2 1 2 1 3
 [445] 3 1 0 0 2 1 2 1 1 2 1 2 2 1 2 3 3 0 1 3 3 1 0 3 1 1 1 3 0 0 2 0 1 0 1 1 2
 [482] 3 2 2 2 0 1 2 0 2 2 0 2 1 0 1 1 1 1 4 0 3 2 2 0 2 1 2 1 0 2 1 2 0 2 5 3 0
 [519] 0 4 2 1 0 2 3 1 1 3 2 1 2 0 2 0 3 0 3 2 4 1 2 4 0 1 0 2 1 1 2 4 2 3 2 0 2
 [556] 3 5 1 1 1 2 3 2 0 1 1 3 0 4 2 1 1 1 3 2 0 3 0 3 2 0 0 4 0 1 0 0 1 3 0 0 4
 [593] 2 1 4 2 0 1 1 2 0 2 0 0 2 2 1 3 0 3 1 0 0 1 1 2 1 1 0 0 2 3 2 1 0 0 2 1 2
 [630] 2 1 2 3 1 1 1 3 2 3 1 2 1 3 1 2 0 2 3 1 1 2 5 2 1 1 1 2 3 2 1 1 0 3 1 3 3
 [667] 1 2 1 1 2 5 2 2 2 1 2 0 2 5 4 4 4 1 1 2 3 3 2 3 5 3 3 1 3 3 4 5 2 1 0 3 0
 [704] 1 2 3 0 1 0 1 1 1 2 0 1 1 1 1 0 0 2 3 1 1 1 1 1 0 0 1 0 2 2 1 0 4 1 7 1 2
 [741] 2 2 0 1 5 3 4 2 0 1 1 2 2 2 3 0 1 0 1 2 2 1 1 2 2 1 1 0 2 2 0 3 0 3 1 2 0
 [778] 2 0 2 5 1 1 1 2 3 2 2 6 4 1 3 1 0 1 0 1 3 3 2 3 1 2 6 1 1 2 2 1 3 1 1 1 4
 [815] 0 3 1 0 1 3 2 3 0 1 2 2 3 1 0 3 2 4 2 0 0 0 4 0 1 0 0 2 0 3 0 2 3 2 0 1 3
 [852] 3 3 1 3 1 2 1 2 3 2 0 2 0 1 3 4 4 1 0 2 0 1 2 1 1 3 0 3 1 1 1 3 1 2 2 0 2
 [889] 1 0 1 1 2 0 2 2 0 2 2 1 2 2 1 2 1 1 1 5 3 1 1 3 3 2 1 1 3 2 3 2 0 2 2 3 0
 [926] 2 2 1 2 1 2 2 1 0 1 4 3 2 1 0 4 0 1 1 0 3 1 2 3 3 2 4 0 1 1 1 0 2 1 0 1 2
 [963] 1 1 5 0 2 2 1 3 1 1 0 0 3 3 0 2 1 2 1 1 1 0 2 3 1 1 2 1 3 1 4 1 2 3 1 1 3
[1000] 2

There’s many more!

Exponential, gamma, beta, binomial… These will serve as the backbone of all of your data simulations.

Central Limit Theorem Example

Create a non-normal ‘population’ distribution

library(ggplot2)
## generate population distribution from exponential distribution, a lot of data
population <- rexp(100000000, rate = 3.2)
qplot(population)

Create a function for simulating Central Limit Theorem

CLT <- function(pop, sampleSize){
library(ggplot2) ## load ggplot2 for the function
samplematrix <- matrix(ncol=1, nrow=iter) ## initialize empty matrix
for(i in 1:iter){
  ## Get samples of 100
  study <- sample(population, 100, replace=F)
  ## store in vector/matrix
  samplematrix[i,] <- mean(study)
}
return( qplot(scale(samplematrix)) ) ## plot matrix
## You can see it converges (i.e., has a limit) on Normal(0,1)
}

Use the functions at three different Ns

## number of iterations
iter <- 100000
p1 <- CLT(population, iter) + ggtitle(paste0("N = " , iter))
iter <- 1000
p2 <- CLT(population, iter) + ggtitle(paste0("N = ", iter))
iter <- 100
p3 <- CLT(population, iter) + ggtitle(paste0("N = ", iter))
library(patchwork)
p1 + p2 + p3

Simulate data

set.seed(123)
N = 100 # N = 100 per group
group1 <- rnorm(N, mean = 0, sd = 1)
group2 <- rnorm(N, mean = 0, sd = 1)
group3 <- rnorm(N, mean = 0, sd = 1)
data <- data.frame(value = c(group1, group2, group3),
                   group = factor(rep(1:3, each = 10)))
data # inspect data
           value group
1   -0.560475647     1
2   -0.230177489     1
3    1.558708314     1
4    0.070508391     1
5    0.129287735     1
6    1.715064987     1
7    0.460916206     1
8   -1.265061235     1
9   -0.686852852     1
10  -0.445661970     1
11   1.224081797     2
12   0.359813827     2
13   0.400771451     2
14   0.110682716     2
15  -0.555841135     2
16   1.786913137     2
17   0.497850478     2
18  -1.966617157     2
19   0.701355902     2
20  -0.472791408     2
21  -1.067823706     3
22  -0.217974915     3
23  -1.026004448     3
24  -0.728891229     3
25  -0.625039268     3
26  -1.686693311     3
27   0.837787044     3
28   0.153373118     3
29  -1.138136937     3
30   1.253814921     3
31   0.426464221     1
32  -0.295071483     1
33   0.895125661     1
34   0.878133488     1
35   0.821581082     1
36   0.688640254     1
37   0.553917654     1
38  -0.061911711     1
39  -0.305962664     1
40  -0.380471001     1
41  -0.694706979     2
42  -0.207917278     2
43  -1.265396352     2
44   2.168955965     2
45   1.207961998     2
46  -1.123108583     2
47  -0.402884835     2
48  -0.466655354     2
49   0.779965118     2
50  -0.083369066     2
51   0.253318514     3
52  -0.028546755     3
53  -0.042870457     3
54   1.368602284     3
55  -0.225770986     3
56   1.516470604     3
57  -1.548752804     3
58   0.584613750     3
59   0.123854244     3
60   0.215941569     3
61   0.379639483     1
62  -0.502323453     1
63  -0.333207384     1
64  -1.018575383     1
65  -1.071791226     1
66   0.303528641     1
67   0.448209779     1
68   0.053004227     1
69   0.922267468     1
70   2.050084686     1
71  -0.491031166     2
72  -2.309168876     2
73   1.005738524     2
74  -0.709200763     2
75  -0.688008616     2
76   1.025571370     2
77  -0.284773007     2
78  -1.220717712     2
79   0.181303480     2
80  -0.138891362     2
81   0.005764186     3
82   0.385280401     3
83  -0.370660032     3
84   0.644376549     3
85  -0.220486562     3
86   0.331781964     3
87   1.096839013     3
88   0.435181491     3
89  -0.325931586     3
90   1.148807618     3
91   0.993503856     1
92   0.548396960     1
93   0.238731735     1
94  -0.627906076     1
95   1.360652449     1
96  -0.600259587     1
97   2.187332993     1
98   1.532610626     1
99  -0.235700359     1
100 -1.026420900     1
101 -0.710406564     2
102  0.256883709     2
103 -0.246691878     2
104 -0.347542599     2
105 -0.951618567     2
106 -0.045027725     2
107 -0.784904469     2
108 -1.667941937     2
109 -0.380226520     2
110  0.918996609     2
111 -0.575346963     3
112  0.607964322     3
113 -1.617882708     3
114 -0.055561966     3
115  0.519407204     3
116  0.301153362     3
117  0.105676194     3
118 -0.640706008     3
119 -0.849704346     3
120 -1.024128791     3
121  0.117646597     1
122 -0.947474614     1
123 -0.490557444     1
124 -0.256092192     1
125  1.843862005     1
126 -0.651949902     1
127  0.235386572     1
128  0.077960850     1
129 -0.961856634     1
130 -0.071308086     1
131  1.444550858     2
132  0.451504053     2
133  0.041232922     2
134 -0.422496832     2
135 -2.053247222     2
136  1.131337213     2
137 -1.460640071     2
138  0.739947511     2
139  1.909103569     2
140 -1.443893161     2
141  0.701784335     3
142 -0.262197489     3
143 -1.572144159     3
144 -1.514667654     3
145 -1.601536174     3
146 -0.530906522     3
147 -1.461755585     3
148  0.687916773     3
149  2.100108941     3
150 -1.287030476     3
151  0.787738847     1
152  0.769042241     1
153  0.332202579     1
154 -1.008376608     1
155 -0.119452607     1
156 -0.280395335     1
157  0.562989533     1
158 -0.372438756     1
159  0.976973387     1
160 -0.374580858     1
161  1.052711466     2
162 -1.049177007     2
163 -1.260155245     2
164  3.241039935     2
165 -0.416857588     2
166  0.298227592     2
167  0.636569674     2
168 -0.483780626     2
169  0.516862044     2
170  0.368964527     2
171 -0.215380508     3
172  0.065293034     3
173 -0.034067254     3
174  2.128451899     3
175 -0.741336096     3
176 -1.095996267     3
177  0.037788399     3
178  0.310480749     3
179  0.436523479     3
180 -0.458365333     3
181 -1.063326134     1
182  1.263185176     1
183 -0.349650388     1
184 -0.865512863     1
185 -0.236279569     1
186 -0.197175894     1
187  1.109920290     1
188  0.084737292     1
189  0.754053785     1
190 -0.499292017     1
191  0.214445310     2
192 -0.324685911     2
193  0.094583528     2
194 -0.895363358     2
195 -1.310801533     2
196  1.997213385     2
197  0.600708824     2
198 -1.251271362     2
199 -0.611165917     2
200 -1.185480085     2
201  2.198810349     3
202  1.312412976     3
203 -0.265145057     3
204  0.543194059     3
205 -0.414339948     3
206 -0.476246895     3
207 -0.788602838     3
208 -0.594617267     3
209  1.650907467     3
210 -0.054028125     3
211  0.119245236     1
212  0.243687430     1
213  1.232475878     1
214 -0.516063831     1
215 -0.992507150     1
216  1.675696932     1
217 -0.441163217     1
218 -0.723065970     1
219 -1.236273119     1
220 -1.284715722     1
221 -0.573973479     2
222  0.617985817     2
223  1.109848139     2
224  0.707588354     2
225 -0.363657297     2
226  0.059749937     2
227 -0.704596464     2
228 -0.717218162     2
229  0.884650499     2
230 -1.015592579     2
231  1.955293965     3
232 -0.090319594     3
233  0.214538827     3
234 -0.738527705     3
235 -0.574388690     3
236 -1.317016132     3
237 -0.182925388     3
238  0.418982405     3
239  0.324304344     3
240 -0.781536487     3
241 -0.788621971     1
242 -0.502198718     1
243  1.496060670     1
244 -1.137303621     1
245 -0.179051594     1
246  1.902361822     1
247 -0.100974885     1
248 -1.359840704     1
249 -0.664769435     1
250  0.485459979     1
251 -0.375602872     2
252 -0.561876364     2
253 -0.343917234     2
254  0.090496647     2
255  1.598508771     2
256 -0.088565112     2
257  1.080799496     2
258  0.630754116     2
259 -0.113639896     2
260 -1.532902003     2
261 -0.521117318     3
262 -0.489870453     3
263  0.047154433     3
264  1.300198678     3
265  2.293078974     3
266  1.547581059     3
267 -0.133150964     3
268 -1.756527396     3
269 -0.388779864     3
270  0.089207223     3
271  0.845013004     1
272  0.962527968     1
273  0.684309429     1
274 -1.395274350     1
275  0.849643046     1
276 -0.446557216     1
277  0.174802700     1
278  0.074551177     1
279  0.428166765     1
280  0.024674983     1
281 -1.667475098     2
282  0.736495965     2
283  0.386026568     2
284 -0.265651625     2
285  0.118144511     2
286  0.134038645     2
287  0.221019469     2
288  1.640846166     2
289 -0.219050379     2
290  0.168065384     2
291  1.168383873     3
292  1.054181023     3
293  1.145263110     3
294 -0.577468001     3
295  2.002482730     3
296  0.066700871     3
297  1.866851845     3
298 -1.350902686     3
299  0.020983586     3
300  1.249914571     3

Conduct false positive rate test for ANOVA

library(ez)
n_simulations <- 1000
alpha <- 0.05
false_positive_rate <- numeric(n_simulations)
for (i in 1:n_simulations) {
  group1 <- rnorm(N, mean = 0, sd = 1)
  group2 <- rnorm(N, mean = 0, sd = 1)
  group3 <- rnorm(N, mean = 0, sd = 1)
  
  data <- data.frame(value = c(group1, group2, group3),
                     group = factor(rep(1:3, each = 10)))
  data$subID <- as.factor(1:nrow(data))
  
  model <- ezANOVA(data=data, dv=value, between=group, wid=subID)
  F <- model$ANOVA$F
  p <- model$ANOVA$p
  
  false_positive_rate[i] <- as.numeric(p < alpha)
}
mean(false_positive_rate)
[1] 0.046

Simulate a regression model

set.seed(20) # for reproducibility
simLM <- function(){
N <- 100 # Sample size
x <- rnorm(100) # Simulate predictor variable
e <- rnorm(100, 0, 2) # Simulate the error term
y <- 0.5 + 2 * x + e # Compute the outcome via the model
summary(lm(y ~ x)) # Implement the model
}
simLM()

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9170 -1.3303  0.1328  1.5261  3.6446 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6777     0.1983   3.417 0.000922 ***
x             2.3596     0.2013  11.719  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.983 on 98 degrees of freedom
Multiple R-squared:  0.5836,    Adjusted R-squared:  0.5793 
F-statistic: 137.3 on 1 and 98 DF,  p-value: < 2.2e-16

Simulate a regression model multiple times:

We can replicate this regression model a bunch of times and you can see that on average the coefficient estimates will generally be around 2.0. The fact that it is not always at exactly 2.0 is a good way to understand sampling error.

repLM <- replicate(500, simLM()$coefficients[2], simplify = F)
qplot(repLM)

Cross Validation

Overfitting

The tendency for statistical models to mistakenly fit sample-specific noise as if it were signal is commonly referred to as overfitting. Minimizing overfitting when training statistical models can be seen as one of the primary objectives of the field of machine learning

An external file that holds a picture, illustration, etc. Object name is nihms-845851-f0001.jpg
8 Simple Techniques to Prevent Overfitting | by David Chuan-En Lin |  Towards Data Science
Techniques for handling underfitting and overfitting in Machine Learning |  by Manpreet Singh Minhas | Towards Data Science

We care about new observations…

We generally do not care very much about how well we can predict scores for the observations in our existing sample, since we already know what those scores are.

In this sense, the prediction error that we compute when we fit a model on a particular dataset is only a proxy for the quantity we truly care about, which is the error term that we would obtain if we were to apply our trained model to an entirely new set of observations sampled from the same population (test error).

When is overfitting worst?

  • A large number of predictors

  • Multicollinearity

  • Small N

  • Small effects

What Does Cross-Validation Mean?

  • Cross-validation is a statistical approach for determining how well the results of a statistical investigation generalize to a different data set.
  • Cross-validation is commonly employed in situations where the goal is prediction and the accuracy of a predictive model’s performance must be estimated.
  • Simply, a model is trained on one dataset and then tested on a completely independent dataset.

Repeated Replication

One can turn a single dataset into two nominally independent datasets rather easily: one simply has to randomly split the original data set into two sets—a training dataset, and a test dataset.

The training half is used to fit the model, and the test half is subsequently used to quantify the test error of the trained model. The estimate from the test dataset will then more closely approximate the true out-of-sample performance of the model.

One can then repeat this cross-validation across “folds”

10-Fold Cross-Validation

Leave-One-Out Cross-Validation

When K is equal to the sample size n—so that the model is fit n times, each time predicting the score for only a single held-out subject—the procedure is commonly referred to as leave-one-out cross-validation (LOOCV).

Computationally expensive, usually only recommended when dataset is large

Let’s get started on CV…

Load Data

# if(!require("lars")){
#   install.packages("lars")
#   library("lars")
#}
library(lars)
data("diabetes")

Inspect Data

X <- diabetes$x
(N <- nrow(X))
[1] 442

Fit Regression Model

y <- diabetes$y # outcome
X <- diabetes$x # all predictors
reg <- lm(y ~ X)
summary(reg)

Call:
lm(formula = y ~ X)

Residuals:
     Min       1Q   Median       3Q      Max 
-155.829  -38.534   -0.227   37.806  151.355 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  152.133      2.576  59.061  < 2e-16 ***
Xage         -10.012     59.749  -0.168 0.867000    
Xsex        -239.819     61.222  -3.917 0.000104 ***
Xbmi         519.840     66.534   7.813 4.30e-14 ***
Xmap         324.390     65.422   4.958 1.02e-06 ***
Xtc         -792.184    416.684  -1.901 0.057947 .  
Xldl         476.746    339.035   1.406 0.160389    
Xhdl         101.045    212.533   0.475 0.634721    
Xtch         177.064    161.476   1.097 0.273456    
Xltg         751.279    171.902   4.370 1.56e-05 ***
Xglu          67.625     65.984   1.025 0.305998    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54.15 on 431 degrees of freedom
Multiple R-squared:  0.5177,    Adjusted R-squared:  0.5066 
F-statistic: 46.27 on 10 and 431 DF,  p-value: < 2.2e-16

Identify Indices

set.seed(12345) # Random seed for reproducibility
new_indices <- sample(nrow(X))
y <- y[new_indices]
X <- X[new_indices,]

Randomly Sample Indices

K <- 10
breaks <- round(quantile(seq(N), probs=seq(0, 1,  1/K ))) # Cut points for 10 blocks
groups <- cut(seq(N), breaks=breaks, include.lowest=TRUE)
indices <- split(seq(N), groups) # split into 10 groups of equal sizes

Manually 10-Fold Cross-Validate

# empty vector to hold results
RMSE <- numeric(K)
df <- as.data.frame(cbind(y, X))
# for each folds...
for(i in 1:K){
  # create train folds
  trainDf <- df[-indices[[i]],]
  # create test fold
  testDf <- df[indices[[i]],]
  # predict Y from all of X
  mod <- lm(y ~ ., data = trainDf)
  # generate predictions
  predictions <- predict(mod, testDf)
  # compute and save RMSE
  errors <- testDf$y - predictions
  RMSE[i] <- sqrt(mean(errors^2))
}

What is our RMSE?

mean(RMSE)
[1] 54.58196
sd(RMSE)
[1] 4.912106

Isn’t there an easier way?

Automated Partitioning using ‘caret’ package

library(caret)
set.seed(123) # for reproducibility
# creating training data as 80% of the dataset
random_sample <- createDataPartition(df$y, p = 0.8, list = FALSE)
trainDf  <- df[random_sample, ] # generating training dataset
testDf <- df[-random_sample, ] # generating testing dataset
model <- lm(y ~., data = trainDf) # training the model
# predicting the target variable
predictions <- predict(model, testDf)
# computing model performance metrics
data.frame( R2 = R2(predictions, testDf$y),
            RMSE = RMSE(predictions, testDf$y) )
         R2     RMSE
1 0.4975122 53.36445

Metrics

  • R2 is typically interpreted as the proportion of the variance in the outcome variable (e.g., task performance) that can be accounted for by the predictors (e.g., arousal level)
    • R2 (when calculated by squaring correlation coefficients) captures the extent to which expected values exhibit the same rank order as observed values, providing a relative measure of model fit;
  • RMSE represents the magnitude of the average squared residual and indicates how much, on average, expected values deviate from observed values.
    • RMSE captures the magnitude of the average squared residual, providing an absolute measure of model fit
  • As R2 and MSE focus on different aspects of model fit, sensible to report both metrics.

Leave-One-Out CV using ‘caret’

# as Leave One Out Cross Validation
train_control <- trainControl(method = "LOOCV")
 
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(y ~., data = df,
               method = "lm",
               trControl = train_control)
 
# printing model performance metrics
# along with other details
print(model)
Linear Regression 

442 samples
 10 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 441, 441, 441, 441, 441, 441, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  54.78819  0.4940917  44.35566

Tuning parameter 'intercept' was held constant at a value of TRUE

Advantages:

  • Less bias model as almost every data point is used for training.

  • No randomness in the value of performance metrics because LOOCV runs multiple times on the dataset

Disadvantages:

  • Training the model N times leads to expensive computation time if the dataset is large.

K-Fold CV using ‘caret’

set.seed(125) # for reproducibility
train_control <- trainControl(method = "cv",
                              number = 10) # 10-fold CV
# train model
model <- train(y ~., data = df,
               method = "lm",
               trControl = train_control)
 
# printing model performance metrics
# along with other details
print(model)
Linear Regression 

442 samples
 10 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 397, 398, 398, 398, 397, 398, ... 
Resampling results:

  RMSE      Rsquared  MAE     
  54.69294  0.499584  44.55305

Tuning parameter 'intercept' was held constant at a value of TRUE

Advantages:

  • Fast computation speed.

  • A very effective method to estimate the prediction error and the accuracy of a model.

Disadvantages:

  • A lower value of K leads to a biased model and a higher value of K can lead to variability in the performance metrics of the model. Thus, it is very important to use the correct value of K for the model(generally K = 5 and K = 10 is desirable).

Repeated K-Fold CV using ‘caret’

set.seed(125) # reproducibility
# 10 folds, repeating 3 times
train_control <- trainControl(method = "repeatedcv",
                            number = 10, repeats = 3)
# train model
model <- train(y ~ ., data = df,
               method = "lm",
               trControl = train_control)
print(model)
Linear Regression 

442 samples
 10 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 397, 398, 398, 398, 397, 398, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  54.49432  0.5039158  44.34722

Tuning parameter 'intercept' was held constant at a value of TRUE

Advantages:

  • In each repetition, the data sample is shuffled which results in developing different splits of the sample data.

Disadvantages:

  • With each repetition, the algorithm has to train the model from scratch which means the computation time to evaluate the model increases by the times of repetition.

Other forms of CV

  • Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.

  • Nested: This is where k-fold cross-validation is performed within each fold of cross-validation, often to perform hyperparameter tuning during model evaluation. This is called nested CV or double cross-validation.

library(Metrics)
model <- train(y ~ ., data = df,
               method = "lm",
               metric = "rmse",
               trControl = train_control)